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Abstract 



H 

» 

I The paper offers a novel unified approach to studying the accuracy of pa- 

rameter estimation by the quasi likchhood method. Important features 
of the approach are: (1) The imderlying model is not assumed to be 
parametric. (2) No conditions on parameter identifiability are required. 
The parameter set can be unbounded. (3) The model assumptions are 
, quite general and there is no specific structural assumptions like inde- 

pendence or weak dependence of observations. The imposed conditions 
I on the model are very mild and can be easily checked in specific applica- 

I tions. (4) The established risk bounds are nonasymptotic and valid for 

' large, moderate and small samples. (5) The main result is the concen- 

tration property of the quasi MLE giving an nonasymptotic exponential 
' bound for the probability that the considered estimate deviates out of a 

' small neighborhood of the "true" point. 

In standard situations under mild regularity conditions, the usual con- 
sistency and rate results can be easily obtained as corollaries from the 
established risk bounds. The approach and the results arc illustrated on 
the example of generalized linear and single- index models. 

JEL codes: C13,C22. 
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1 Introduction 

One of the most popular approaches in statistics is based on the parametric assumption 
that the distribution P of the observed data Y belongs to a given parametric family 
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{Pe, G 0) , where 6* is a subset in a finite dimensional space . In this situation, 
the statistical inference about P is reduced to recovering . The standard likelihood 
principle suggests to estimate by maximizing the corresponding log-likelihood function 
L{Y,6) . The classical parametric statistical theory focuses mostly on asymptotic prop- 
erties of the difference between and the true value Oq as the sample size n tends to 
infinity. There is a vast literature on this issue. We only mention the book by Ibragimov 
and Khas'minskij (1981), which provides a comprehensive study of asymptotic proper- 
ties of maximum likelihood and Bayesian estimators. The related analysis is effectively 
based on the Taylor expansion of the likelihood function near the true point under the 
assumption that the considered estimate well concentrates in a small (root-n) neighbor- 
hood of this point. In the contrary, there is only few results which establish this desired 
concentration property. Ibragimov and Khas'minskij (1981), Section 1.5, presents some 
exponential concentration bounds in the i.i.d. parametric case. Large deviation results 
about minimum contrast estimators can be found in Jensen and Wood (1998) and Sieders 
and Dzhaparidze (1987), while subtle small sample size properties of these estimators are 
presented in Field (1982) and Field and Ronchetti (1990). This paper aims at studying 
the concentration properties of a general parametric estimate. The main result describes 
some concentration sets for the considered estimate and establishes an exponential bound 
for deviating of the estimate out of such sets. 

In the modern statistical literature there is a number of papers considering maximum 
likelihood or more generally minimum contrast estimators in a general i.i.d. situation, 
when the parameter set is a subset of some functional space. We mention the papers 
Van de Geer (1993), Birge and Massart (1993), Birge and Massart (1998), Birge (2006) 
and references therein. The studies mostly focused on the concentration properties of 
the maximum over 6 £ of the log-likelihood L{Y, 0) rather on the properties of the 
estimator which is the point of maximum of L{Y,6) . The established results are 
based on deep probabilistic facts from the empirical process theory (see e.g. Talagrand 
(1996), van der Vaart and Wellner (1996), Boucheron et al. (2003)). Our approach is 
similar in the sense that the analysis also focuses on the properties of the maximum of 
L{Y,6) over 6 £ . However, we do not assume any specific structure of the model. 
In particular, we do not assume independent observations and thus, cannot apply the 
methods from the empirical process theory. 

The aim of this paper is to offer a general and unified approach to statistical estima- 
tion problem which delivers meaningful and informative results in a general framework 
under mild regularity assumptions. An important issue of the proposed approach is that 
it allows to go beyond the parametric case, that is, the most of results and conclusions 
continue to apply even if the parametric assumption is not precisely fulfilled. Then the 
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target of estimation can be viewed as the best parametric fit. Some other important fea- 
tures of the proposed approach are that the estabhshed risk bounds are nonasymptotic 
and equally apply to large, moderate and small samples and that the results describe 
nonasymptotic confidence and concentration sets in terms of quasi log-likelihood rather 
than the accuracy of point estimation. In the most of examples, the usual consistency 
and rate results can be easily obtained as corollaries from the established risk bounds. 
The results are obtained under very mild conditions which are easy to verify in particular 
applications. There is no specific assumptions on the structure of the data like indepen- 
dence or weak dependence of observations, the parameter set can be unbounded. Another 
interesting feature of the proposed approach that it does not require any identifiability 
conditions. Informally, one can say that whatever the quasi likelihood or contrast is, 
the corresponding estimate belongs with a dominating probability to the correspond- 
ing concentration set. Examples show that the resulting concentration sets are of right 
magnitude, in typical situations this is a root-n vicinity of the true point. 

Now we specify the considered set-up. Let Y stand for the observed data. For 
notational simplicity we assume that 1^ is a vector in iR" . By P we denote the 
measure describing the distribution of the whole sample Y . The parametric approach 
discussed below allows to reduce the whole description of the model to a few parameters 
which have to be estimated from the data. Let [Pe, S 0) be a given parametric 
family of measures on iR" . The parametric assumption means simply that P = Pgf^ for 
some 9q £ . The parameter vector 6q can be estimated using the maximum likelihood 
(MLE) approach. Let L{Y, 6) be the log-likelihood for the considered parametric model: 
L(Y, 6) = log ^^{Y) , where Pq is any dominating measure for the family (Pe) ■ The 
MLE estimate 9 of the parameter 6q is given by maximizing the log-likelihood L{6) : 

= argmaxL(l",6'). (1.1) 

Note that the value of the estimate will not be changed if the process L{Y, 0) is multi- 
plied by any positive constant /U . 

The quasi maximum likelihood approach admits that the underlying distribution P 
does not belong to the family [Pe] ■ The estimate 6 from (1.1) is still meaningful and 
it becomes the quasi MLE. Later we show that estimate the value Oq defined by 
maximizing the expected value of L{Y , 9) : 

9q =^ argmax EL{Y, 9) 
flee 

which is the true value in the parametric situation and can be viewed as the parameter 
of the best parametric fit in the general case. 
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Note that the presented set-up is quite general and the most of statistical estimation 
procedures can be represented as quasi maximum likelihood for a properly selected para- 
metric family. In particular, popular least squares, least absolute deviations, M-estimates 
can be represented as quasi MLE. 

The set-up of this paper is even more general. Namely, we consider a general estimate 
6 defined by maximizing a random field ^{9) ■ The basic example we have in mind is the 
scaled quasi log-likelihood = fiL{Y,d) for some /i > . In some cases, especially 

if the parameter set is unbounded, the scaling factor fj, can also be taken depending on 
0, that is, ^{9) = fj,{6)L(Y ,6) . We focus on the properties of the process ^^{9) as 
a function of the parameter 9 . Therefore, we suppress the argument Y there. One 
has to keep in mind that ii{9) is random and depends on the observed data Y . The 
study focuses on the concentration properties of the estimate 9 which is defined by 
maximization of the random process ^{9) . Let 

9q = avgmax.EZj{9). 


We also define L{9,9q) = L{9)—£j{9q) . The aim of our study is to bound the value of the 
quasi maximum likelihood 9q) = maxg L{9, 9q) . The basic assumption imposed on 
the process ii{9) is that the difference £j[9^9q) = L{9) — L[9q) has bounded exponential 
moments for every 9 . Our primary goal is to bound the supremum of such differences, or 
more precisely, to establish an exponential bound for the value £j{9,9q) . The standard 
approach of empirical process theory is to consider separately the mean and the centered 
stochastic deviations of the process ^{9) . Here a slightly different standardization of 
the process ^{9) is used. Assume that the exponential moment for £(0, 0o) is finite for 
all 9 . This enables us to define for each 9 the rate function Tl{9, 9q) which ensures 
the identity 

Eexp{L{9,9o) + m{9,9o)} = 1. 

This means that the process ii{9, 9q) + ^{9, 9q) is pointwise stochastically bounded in a 
rather strict sense. We aim at establishing a similar bound for the maximum of L{9, 9q) + 
971(0, 9q) . It turns out that some payment for taking the maximum is necessary. Namely, 
we present a penalty function pen(0) which ensures that the maximum of iL(0,0o) + 
971(0, 9q) — pen{9) is bounded with exponential moments. Then we show that this 
fundamental fact yields a number of straightforward corollaries about the quality of 
estimation. 

The paper is organized as follows. The next section presents the main result which 
describes an exponential upper bound for the (quasi) maximum likelihood. Section 2.2 
discusses some implications of this exponential bound for statistical inference. In partic- 
ular, we present a general likelihood-based construction of confidence sets and establish 
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an exponential bound for the coverage probability. We also show that the considered 
estimate well concentrates on the level set of the rate function 9Jt(0,0o) • Under some 
standard conditions we show that such concentration sets become usual root-n neighbor- 
hoods of the target 6q . Sections 3 and 4 illustrate the obtained general results for two 
quite popular statistical models: generalized linear and single index. These models are 
very well studied, the existing results claim asymptotic normality and efficiency of the 
maximum likelihood estimate as the sample size grows to infinity. On the contrary, our 
study focuses on nonasymptotic deviation bounds and concentration properties of this 
estimate. The main result giving an exponential bound for the maximum likelihood is 
based on general results for the maximum of a random field described in Section 5. 

2 Exponential bound for the maximum likelihood 

This section presents a general exponential bound on the (quasi) maximum likelihood 
value in a quite general set-up. The main result concerns the value of maximum ^{0) = 
max0g6)£(0) rather than the point of maximum 6. Namely, we aim at establishing 
some exponential bounds on the supremum in 6 of the random field 

i:(0,0o) ='^W-'C(0o). 

In this paper we do not specify the structure of the process ^{0) . The basic assump- 
tion we impose on the considered model is that -0(0) is absolutely continuous in and 
that ^{0) and its gradient w.r.t. have bounded exponential moments. 

{E) The rate function 071(0, Oq) is finite for all 9 € : 

Tl{e,6o) = -logEe^p{li{e,eo)}. 

Note that this condition is automatically fulfilled if P = Pq^ and £^(0) = fi\og{dP q / dP 
with /U < 1 provided that all Pq are absolutely continuous w.r.t. Pg^ . With /i = 1 and 
L{e) = log{dPg/dPg^,) , it holds Tl{e,Oo) = - \og E e^{dP e / dP Q,) = 0. For /i < 1, 
M{9,6q) = - log E0^^{dP0/dPeoY > for /i < 1 by the Jensen inequality. 
The main observation behind the condition [E) is that 

£;exp{^(6>,0o) +2n(6>,6'o)} = 1- 

Our main goal is to get a similar bound for the maximum of the random field £^(0, 6q) + 
d}l{9,6Q) over G 0. Below in Section 2.2 we show that such a bound implies an 
exponential bound for the coverage probability for a confidence set £(3) = {9 : ii{9, 9) < 
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3} and that the estimate 6 well concentrates on a set A{x,6q) = {0 : 9JI(0,0o) ^ ^} in 
the sense that the probability of the event {0 ^ A{x,6q)} is exponentially small in r . 
Unfortunately, in some situations, the exponential moment of the maximum of ^{0, Oo) + 
Oq) can be unbounded. We present a simple example of this sort. 

Example 2.1. Consider a Gaussian shift with only one observation Y ~ 1^(9, 1) and 
suppose that the true parameter is ^0 = . Then the log-likelihood ratio £j{9,9q) 
reads as L{9,9o) = Y9 - 9^/2, and it holds m{9,9o) = 0, supgL{e) = and 
exp{supg£j{9,9o)} = Eg^ exp{Y'^/2} =00. 

We therefore consider the penalized expression ii{0, Oq) + Oq) — pen{6) , where 
the penalty function pen(0) should provide some bounded exponential moments for 



To bound local fluctuations of the process £^(0) , we introduce an exponential moment 
condition on the stochastic component Ci^) '■ 



Suppose also that the random function ("(0) is differentiable in 9 and its gradient 
VC(0) = dC{6)/de G RP fulfills the following condition: 

(ED) There exist some continuous symmetric matrix function V{9) for 6^0 and 
constant A* > such that for all |A| < A* 



Define for every 6,6' £0, d = \\6 - 6'\\ and 7 = (0' - 6)/d 

©2(6»,6»') / -f^V{6 + tdj)jdt. 

Jo 

Next, introduce for every 6° £ the local vicinity 23(e, 0°) such that &{d,6°) < e for 
all 6 G S(e,0°). 

Let also the function V{-) from (ED) satisfy the following regularity condition: 
{V) There exist constants e > and vi > 1 such that 



sup [L{6,6o) + Tl{6,do)-pen{6)]. 



e&0 



C(6>) = L{6) - EL{6). 




(2.1) 



sup sup — 

9,e°e0: e{9,e°)<t -yesp 7 



7Ty(r)7 
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Now we are prepared to state the main result which gives some sufficient condition 
on the penalty function pen(0) ensuring the desired penalized exponential bound. It is 
a specification of a more general result from Theorem 5.5 in Section 5. 

Here and in what follows u>p (resp. Qp ) denotes the volume (resp. the entropy 
number) of the unit ball in ]RP . 

Theorem 2.1. Suppose that the conditions (E) is fulfilled and (ED) holds with some 
X* and a matrix function V{0) which fulfills (V) for e > and vi > 1 . If for some 
Q E (0, 1) with ge/{l — q) < \* , the penalty function pen{6) fulfills 

Sj.ig) = log|a;-ie-P ^ y/det{Vie)) exp{-gpenM}de^ < oo (2.2) 
with pen^(0°) = infgg^^^ pen{0) , then 

Eexp\supgme,eo) + m{e,eo) -pen{e)]} <£l{g) (2.3) 

where 

\og£l{g) = -^ + {l- g)% + ^,{g)+p\og{vi). (2.4) 

1 Q 

2.1 Penalty via the norm ||v^\^(0 — Oq)\\ 

The choice of the penalty function pen(0) can be made more precise if V{0) < V* for 
a fixed matrix V* and all . This section describes how the penalty function can be 
defined in terms of the norm ||\/y*(0 — 6q)\\ . 

Theorem 2.2. Let the conditions (E) and (ED) be fulfilled and in addition V{9) < V* 
for some matrix V* for all £ . Let g £ (0, 1) and e > be fixed to ensure 
ge/{l — g) < y ■ Suppose that x(r) is a monotonously decreasing positive function on 
[0, +CX)) satisfying 

f POD 

*p* 4|f u;-^ / H{\\z\\)dz = p / K{ty-^dt < oo. (2.5) 

JlRP Jo 

Define 

pen{e) = -g-^\ogx{e-^\\Vv*{e - eo)\\ + l). (2.6) 
Then the assertion (2.3) holds with 

logn(e) = ^ + (1 - + iog(r )• 

i g 

Proof. This result is a straightforward corollary of Theorem 2.1 applied with V{9) = V* 
and thus, condition (V) is fulfilled with vi = 1 . □ 
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Here two natural ways of defining the penalty function pen(0) : quadratic or loga- 
rithmic in ||-v/y*(0 — 0o)|| • The functions and the corresponding *p* -values are: 

where 61,62 > are some constant and [a]+ means max{a,0}. The corresponding 
penalties read as: 

peni(0) = Q-'6ie-^Vv^{e-9o)\\\ 

pen^iO) = -g-\p + d2)log{e~'\\Vv^{0-eo)\\+2). 

2.2 Some corollaries 

Theorem 2.1 claims that the value ^{0, 6q) + 9}t(0, 6q) — pen{6) is uniformly in 6 £ 
stochastically bounded. In particular, one can plug the estimate in place of : 

£;exp{^[i:(0,6lo) + 971(0, 6>o) -pen(0)]} < 12(^). (2.8) 

Below we present some corollaries of this result. 

2.2.1 Concentration properties of the estimator 

Define for every subset A of the parameter set the value 

3(^) = inf {9Jt(0, Oo) - pen{e)}. (2.9) 

The next result shows that the estimator 6 deviates out of the set A with an exponen- 
tially small probability of order exp{—Qi{A)} . 

Corollary 2.3. Suppose (2.8). Then for any set Ac 

P{6^A) <£j(^)e-^3(A)^ 

Proof. Ue^A, then Tl(e, Oq) - pen(0) > i{A) . As £.(0, 0o) > , it follows 

Q{q) > Eexpi^g[L(e,eo) + m(e,eo) -pen(e)]^ 

> £;exp|^[9K(0,6'o) -pen(0)]} > e^^"^^ P {9 ^ A) 

as required. □ 

Two particular choices of the set A can be mentioned: 

A = A{x,eo) = {e:m{e,eo)<x}, 

A = A'{x,eo) = {e ■.m{e,eo) -pen{e) <x}, 
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For the set ^'(r, Oq) , Corollary 2.3 yields 

P(0 A'{x, Oo)) = P{Tl(e, Oo) - pen(0) > r) < £l{g)e-^\ 
For the set A{t, Oq) , define additionally b(r) as the minimal value for which 
Tl{e, Oq) - pen(0) > r - b(r), 6 A{x, Oq), 

or, equivalently, 

b(r) = sup {r + pen(0) -9Jt(6>,6>o)}. (2.10) 

Corollary 2.4. Suppose (2.8). Then for any r > 

^yi(r,6>o)) = P(97t(0,6'o) > r) < £l(^)e-^^['^-''W]. 

In typical situations the value 6q) is nearly proportional to the sample size re 

and is nearly quadratic in — Oq so that for a fixed r the set yi(r, 0o) corresponds to a 
root-n neighborhood of the point Oq . See below in Section 2.4 for a precise formulation. 

2.2.2 Confidence sets based on £j{0,0) 

Next we discuss how the exponential bound can be used for establishing some risk bounds 
and for constructing the confidence sets for the target Oq based on the maximized value 
ii{0,0) . The inequality (2.8) claims that £j(0,Oq) is stochastically bounded with finite 
exponential moments. This implies boundness of the polynomial moments. 
Define 

b = b(0) = sup[pen(0) - Tl{0, Oo)]+ . (2.11) 
e 

Corollary 2.5. Suppose (2.8) and let b from (2.11) he finite. Then 

Eex.^{QL(0,OQ)] < e^^£!(£»). 

Proof. Observe that 

£7exp{^£(0,6>o)} < e^''^exp{^[i:(0, 6>o) + 9Jl(0, 6>o) - pen(0)] } < e^^Hig) 

as required. □ 

By the same reasons, one can construct confidence sets based on the (quasi) likelihood 
process. Define 

E{l) = {0 ^e:li{0,0)<i]. 

The bound for ^[O^Oq) ensures that the target Oq belongs to this set with a high 
probability provided that 3 is large enough. The next result claims that £(3) does not 
cover the true value Oq with a probability which decreases exponentially in 3 . 
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Corollary 2.6. Suppose (2.8). For any 3 > 



P{eo i £(3)) < ^{q) exp{-£)3 + Qh). 



Proof. The bound (2.8) implies for the event {^o ^ £(3)} = {^{0,0o) > 3} 



p{eo^E{i)} < p{Q[L{e,eo) + m{e,Oo) -pen{e)] > Qi- gb} 

< exp{-Qi + Qb}Eexp{Qme, Oq) + m(e, 6>o) - pen(0)] } 

< £l{Q)exp{-Qi + gb} 



as required. 



□ 



2.3 Identifiability condition 

Until this point no any identifiability condition on the model has been used, that is, 
the presented results apply even for a very poor parametrization. Actually, a particular 
parametrization of the parameter set plays no role as long as the value of maximum 
is considered. If we want to derive any quantitative result on the point of maximum 
, then the parametrization matters and an identifiability condition is really neces- 
sary. Here we follow the usual path by applying the quadratic lower bound for the 
rate function 971(0, ^o) in a vicinity of the point 6q . Suppose that the rate func- 
tion 9Jl(0,0o) = ~ log -E' exp|iL(0, 0o)} is two times continuously differentiable in 6. 



Obviously dJl{6o,do) = and simple algebra yields for the gradient VdJl{0,6o) = 

dm{e,eo)/de: 



because 9q is the point of maximum of E£j{0) . The Taylor expansion of the second 
order in a vicinity of Oq yields for all 6 close to 6q the following approximation: 



with the matrix I{6q) = £^V^9Jl(0, 0o)|0=0u • So, one can expect that the rate function 
9Jt(0, Oq) is nearly quadratic in — 6q in a neighborhood of the point 6q . 

Corollary 2.7. Let (2.8) hold. Suppose that for some positive symmetric matrix D and 
some r > , the function 9K(0,0o) fulfills 



V9H(0,0o)|e=0o 



EV!i{e)\e=0, 



'0 



VEL{eo) = 



m{e,eo) ~ {e -e^y i{e^){e -eo)/2 



(2.12) 



Then for any 3 < r 
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Proof. It is obvious that 

{\\Die - eo)f > i} c {\\Die-eo)f>i,eeAit,eo)}u{e^Aix,Oo)} 

C {9Jt(0,0o) >hO£ ^(r,6>o)} U {9Jt(^,6»o) > 3} 

= {m(e,eo)>5} 

and the result fohows from Corohary 2.4. □ 

In the next theorem we assume the lower bound (2.12) to be fulfilled on the whole 
parameter set O . The general case can be reduced to this one by using once again the 
concentration property of Corollary 2.4. 

Theorem 2.8. Suppose {E) , {ED) with V{0) < V* for a matrix V* . Let also for 
some a > 

m{e,eQ)>o?{e-eQ)^v*{e-eo), eeo. (2.13) 

Fix some ai < a and define pen{6) by 

Ven{e) = ali9-eo)^V*{e-6o). (2.14) 
Then with s = 1 — of /o^ it holds 

£l{g,s) =^ logEe^p{Qsnp[L{e,eo) +m{e,eo) -peniO)]} 

9 

< 2q+{1- Q)% + \og[l+ ^ 



< pC{Q)+p\og{\a\l-s){l-Q)\-^/^) (2.15) 

for some fixed constant C{q) . In addition, b(r) from (2.10) fulfills b(r) = for all 
r > yielding for any 3 > the concentration property and confidence hound: 

p{e^A{i,eo)) < £}(e,s)e-^«^ yL(3,0o) = {e:m{e,eo)<i}, 
P(0o0e(3)) < Q{Q,o)e-'', £(3) = {e:L(e,e)<i}. 

Proof. We apply Theorem 2.2 with 

x(t) = exp{-(l-£))a?(t-l)2} 

leading for £^ = {1 — q)/q and t = e~^\\VV*{6 — 0o)|| to the formula (2.14) for pen{6) . 
By simple algebra 

p/2 
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cf. the bound (2.7) for *p* with 5i = {1 — £i)of . This imphes the bound (2.15) for the 
£l{g) because p~^Qp and logLo"^ are bounded by some fixed constants. 

The inequahty (2.13) ensures for r = TI{0,Oq) that pen(0) < af/a^r, i.e. b(r) < 
al/a^x and b = b(0) = 0. Finally, the concentration and coverage bounds follow from 
Corollaries 2.4 and 2.6. □ 

Remark 2.1. If the quadratic lower bound (2.13) is only fulfilled for 9 from an elliptic 
neighborhood 23'(r, 0o) = : ||\/y^(0 — 0o)|| ^ ^} of the point Oq with a sufficiently 
large r , then it is reasonable to redefine the penalty function using the hybrid proposal: 

f'i?ll^(^-^o)||', 0GS'(r,0o), 
pen{9) = i 

( Q-^P + 1) log{\\Vv^i9 -9o)\\+2), 9^ S'(r, 9o). 

Then the bound (2.15) still applies with the obvious correction of the value £1{q,s). 
However, the values b and b(r) from (2.10) entering in our risk bounds have to be 
corrected depending on the behavior of the rate function Tl{9, 9^) for 9 ^'(r, 9^) . 

2.4 Discussion 

This section collects some comments about the presented exponential bound. 
Bounds for polynomial loss 

Our concentration result is stated in terms if the rate function 9Jt(0, 9q) . Note that the 
bounds (2.15) and (2.13) imply the usual result about the quadratic loss ||\/y*(0 — 0o)|P '■ 

P{\\aVv^{9 - 9o)f > < Q{0,s)e-^'K 

Note however, that the result (2.15) in terms of the rate function 971(0, ^o) is more 
accurate because the lower bound (2.13) can be very rough. The bound (2.13) as well 
as the bound V{9) < V* are only used to evaluate the constants in the exponential 
risk bound. Moreover, if ^3 or s approaches one, the leading term in the risk bound is 
plog(|(l — s)(l — q)\~^^'^) which does not depend on a or V* . 

Coverage probability and risk bounds 

The result of Corollary 2.5 justifies the use of confidence set £(3) = {9 : L{9,9) < 3} . 
However, the bound for the coverage probability given by this result is quite rough and 
cannot be used for practical purposes. One has to apply one or another resampling 
scheme to fix a proper value 3(a) providing the prescribed coverage probability 1 — a . 

The same remark applies to the result of Corollary 2.7. All these bounds are deduced 
from rather rough exponential inequalities and constants shown there are not optimal. 
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However, the concentration property enables us to apply the classical one-step improve- 
ment technique to build a new estimate which achieves the asymptotic efficiency bound. 

Root-n consistency 

Suppose that there exists a constant n (usually this constant means the sample size) 
such that the functions 

v{e) = n"V(6'), m(6>, 6>o) = n~^m{e, Oq) 

are continuous and bounded on every compact set by constants which only depend on 
this set. In addition we assume similarly to (2.12) that for some fixed symmetric positive 
matrix Di and some r > , it holds in the vicinity A{x, 6q) of the point 6q : 

m{e,eo)>{o-eo)^Dl{o-eo), v{e)<a~^Dl. (2.16) 

Then Tl{e,eo) > n{e - Oof DfiO - Oq) and the elliptic set yi*(r,0o) =^ : (^ - 
Oq)~^ Df{0 — 6q) < x/n} is a root-n neighborhood of the point Oq . By Theorem 2.8 the 
estimate deviates from this neighborhood with probability which decreases exponen- 
tially with r : 

P{\\D^(e - 0o)f > x/n) < £l{g,s)e~^'\ 
Local approximation 

The standard asymptotic theory of parameter estimation heavily uses the idea of local ap- 
proximation: the considered (quasi) log likelihood is approximated by the log-likelihood 
of another simpler model in the vicinity of the true point yielding the local asymptotic 
equivalence of the original and the approximating model. The local asymptotic normal- 
ity (LAN) condition is the most popular example of this approach; see Ibragimov and 
Khas'minskij (1981), Ch. 2, for more details. A combination of this idea with the con- 
centration property of Corollary 2.4 can be used to derive sharp asymptotic risk bounds 
for the estimate 6; see again Ibragimov and Khas'minskij (1981), Ch. 3. Similarly 
one can derive non asymptotic risk in the framework of this paper. However, a precise 
formulation of the related results is to be given elsewhere. 

Large and moderate deviation 

The obtained results can be used to derive large and moderate deviations for the estimate 
; cf. Jensen and Wood (1998), Sieders and Dzhaparidze (1987). Particularly, the 
deviation result from Corollary 2.4 can be used to study the efficiency of the estimate 6 
in the Bahadur sense; see e.g. Arcones (2006) and reference therein. 
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3 Estimation in a generalized linear model 

In this section we illustrate the general results of Sections 2 and 2.2 on the problem of 
estimating the parameter vector in the so called generalized linear model. Let T be 
an exponential family with the canonical parametrization (EFC) which means that the 
corresponding log-likelihood function can be written in the form 

where d[-) is a given convex function; see Green and Silverman (1994). The term £(y) 
is unimportant and it cancels in the log-likelihood ratio. 

Let Y = (Yi, . . . , Yn) be an observed sample. A generalized linear assumption means 
that the Yi 's are independent, the distribution of every Yi belongs to 7 and the corre- 
sponding parameter linearly depends on given feature vectors ^i : 

Yi ~ P^Te (3.1) 

i 

To be more specific we consider a deterministic explanatory variables 'Fi, . . . ■ The 
case of a random design can be considered in the same way. 

The parametric assumption (3.1) leads to the log-likelihood L{6) = '^i^iXii'I^J d) '■ 

m = Y,m,^70) = ^{Y^^Fje - d[^Je) +£{Y,)}. (3.2) 

i i 

Asymptotic properties of the MLE 9 = argmax^ L(6) are well studied. We refer to 
Fahrmeir and Kaufmann (1985), Lang (1996), Chen et al. (1999) and the book McCul- 
lagh and Nelder (1989) for further references. The results claim asymptotic consistency, 
normality and efficiency of the estimate 9 . 

Our approach is a bit different because we do not assume that the underlying model 
follow (3.1). The observations Yi are independent, otherwise any particular structure is 
allowed. In particular, the distribution of every Yi does not necessarily belong to 7 . 
The considered problem is the problem of the best parametric approximation of the data 
distribution P by the GLM's of the form YI - Pq,TQ . 

i 

Example 3.1. [Mean regression] The least squares estimate 6 in the classical mean 
regression minimizes the sum of squared residuals: 

9 = argmin V(yi - Fj9f. 
« i 

This estimate can be viewed as the quasi MLE for to the Gaussian homogeneous errors. 
However, many of its properties continue to hold even if the errors are not i.i.d. Gaussian. 
What we only need is the existence of exponential moments of the errors. 
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Example 3.2. [Poisson regression] Let Yi be some nonnegative integers, observed at 
"locations" Xi . Such data often appear in digital imaging, positron emission tomog- 
raphy, queueing and traffic theory, and many others. A natural way of modeling such 
data is to assume that every Yi is Poissonian with the parameter which depends on the 
locations Xi through the regression vector 'Fi . A generalized linear model assumes that 
the canonical parameter of the underlying Poisson distribution of Yi linearly depends on 
the vector l^j leading to the (quasi) MLE 

n 

9 = aTgmax^{Yi'l'l 9 - expi'l'le)}. 
^ i=i 

For our further analysis we only require that every Yi has a bounded exponential moment, 
see below the condition (3.5) for a precise formulation. 

In the general situation, for some /i > which will be fixed later, define 

^9) = ^^J2{y.^7o-d{^7o)}■ 

i 

The target 9q maximizes ESi{6) : 

6»o = aTgmaxE£j{9) = argmax Vjfci'P'^^ - di'l'^O)}, (3.3) 
e e Y 

where hi = EYi . This yields 

VEIii9)l^^^ = ^,Y,{b^-diW'70o)}^^ = 0, (3.4) 

i 

Next, for every 9 

Cio) L{e)-EL{e) = |,Y,iy^-b^Ws, 

i 

VC(0) = VC = fiY.{Yi-bi)Fi, 

i 

Let there exist a positive value A* , and for every i < n the value rij > such that 

log.Eexp|2A '^'~^' | < 2A2, |A| < AJ . (3.5) 

I. XXi J 

In the case of Gaussian errors, one can take rij = Si *== E^^'^{Yi — bi)'^ . Define 

j;nf^.^7, V''^^,''V,. (3.6) 

i 

The matrix V is a symmetric and non-negative. Denote 0^(7) *== 'y^FiX\i{'^~^Vi'^)~^/'^ 
for any 'y £ . By definition, |cj(')')| < 1 , however, usually |cj(7)| is much smaller, of 
order l/^/n. 
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Lemma 3.1. Suppose (3.5). If for some fi > , it holds ii\^J [6 — 0o)|t^i ^ for for 

all i and all 6 £ , then (E) is satisfied. Moreover, if A is such that |Acj(7)| < 
for all i and any 7 G S'^ , then {ED) holds with this A and V{-) = V from (3.6). 

Proof. Let |Acj(7)| < A* for all i. Independence of the Yi 's and (3.5) imply in view of 
E,c2(7) = l that 

log^exp|2A j^^^^^^^ - 2A2} = ^ log £; exp{2Ac.(7)^} - 2X^ < 0. 

This implies (ED) with V{0) = V . □ 

An important feature of the GLM is that the gradient VC(0) and hence, the cor- 
responding matrix V{9) do not depend on 6 . This automatically yields the condition 
(V) with ui = 1 and any e > . 

Now we consider the rate function Oq) =^ — log £^ exp{£-(0, 6q)} . It holds 

aji(0,0o) = fiY,{d{^Je)-d{^JeQ)-^J{e-e^)h] 

i 

- Y,\ogEe^Y>{lJi^J[e-e^){Yi-h)]. (3.7) 

i 

This function is smooth in and by (3.4) 

van(0,0o)|«=«„ = /iJ^{d(iz^reo)-6.}a^. = o. 

i 

Moreover, 

v2an(0,0o) = /x^d(.z'70)^,^7-/i2^ 4(0)^,^7 

i i 

where with = Yi — hi and u = 9 — 6q 

sm (^;e'^^^<')"' f^E^fe'^'^^^'Ee'^''^^' - (£;e.e^^^^<')'| 

Particularly, s?(0o) = •^f = Varl^. li u = 6 — Oq is small then Si{6) is close to 
Si . The "identifiability" condition which would provide the concentration property from 
Theorem 2.7 means that for some fixed positive constants n and a 

mie,eo)>a^{e-9o)'^v{9-eo) (3.8) 

at least for all 9 from a vicinity of 9q . The next lemma presents some simple sufficient 
conditions. 
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Lemma 3.2. Let for a subset 0q C hold: 

i i 

for some positive constants ai = ai(0o) , a = a{0o) ■ Then (3.8) is fulfilled with any /x 
satisfying /i < Oi /o^ . 

Proof. With /U < tti/a^ for any 9 e 0o 

-V^Tl{e,eo) > 2fiaiVi - aVVi = (2^-^01 - a^)^ > aV. 

Now the result follows by the second order Taylor expansion of 971(0, Oq) at 6 = Oq . □ 

Now we are ready to state the main result for the GLM estimation problem which is 
a specification of Theorems 2.2 and 2.8. 

Theorem 3.3. Suppose that the Yi 's are independent and the point 6q is defined by 
(3.3). Let there exist > and the values rij such that (3.5) is fulfilled. Let, addition- 
ally, for some fj, > and all 0^0 

Hni\^J {6 - eo)\ < i<n, 

and with the matrices V,V\ from (3.6) and some A* > .■ 

Fix any g < 1 and e > with Qe/{1 — g) < X* . Define and pen{6) by (2.5) and 
(2.6). Then 

log£;exp|sup£.[£(0,0o) +9?t(0,0o) - pen(0)] | 
< 2e2^V(l + 9)% + log(r )• 
Let also there exist a > such that the function 971(0, ^q) from (3.7) fulfills 

m{e,Oo) > a^o-OoVvio-eo). 

Then for s = 1 - af /a^ 

P{Tl{e,eo)>i) < £l{g,s)e-^'^ 

logQ{g,s) < pC{g)+p\og{\a\l-s){l-g)r'/') 

and for the confidence set £(3) = {6 : li{0,0) < 3} holds with 0.{g) = £l{g,0) 

P{eo^e{i))<£l{g)e~^K 
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4 Single-index regression 

In this section we illustrate the general results of Sections 2 and 2.2 by the problem of 
estimating the index vector in the so called single-index regression model. Such models 
are frequently used in statistical modeling to overcome the "curse of dimensionality" 
problem, see Stone (1986). 

Let Y = (Yi, . . . , Yn) be an observed sample. We assume that the Yi 's are indepen- 
dent and the distribution of every Yi belongs to an exponential family J" with canonical 
parametrization: 



where the underlying parameter /j can be different for each i . Regression analysis 
aims at explaining this parameter /j as a function of the explanatory vector Xi G iR*^ : 
fi = f{Xi) for some regression function /(•) . We again consider a deterministic design 
Xi, . . . , Xn ■ The assumption fi = f{Xi) reduces the original problem to recovering the 
regression function /(•) from the observed data. However, in the case of a large d this 
problem is too complex because of the design sparsity. This "curse of dimensionality" 
problem can be avoided by some dimensionality reduction assumption. Below we consider 
one possible assumption of this sort: 



where g{-) is a univariate link function, while 6 G ]R is an index vector. This assump- 
tion effectively means that the explanatory vector Xi can be projected on the index G 
and this projection can be used instead of the original vector without any information 
loss. Therefore, the primary goal in estimation of a single index model is in recovering 
the index vector 6 . There is a number of results in the literature about the quality of 
estimation of the index vector 6 . We mention Li (1991), Ichimura (1993), Hardle et al. 
(1993), Hristache et al. (2001), Xia et al. (2002), Climov et al. (2002), Delecroix et al. 
(2003), Yin et al. (2008) among many others. 

Below we assume that the link function g is given and it is sufficiently smooth. 
Note however, that the underlying model just follows (4.1). The considered problem 
is the problem of the best parametric approximation of the function /(x) by single 
index function g{x~^0) with a fixed link function g{-) . Such problem often occurs as an 
important building block in popular statistical procedures like logit regression, projection 
pursuit of neuronal networks. Note that for a linear link function g{-) we come back to 
generalized linear estimation. 

The parametric assumption f{Xi) = g{Xj 0) leads to the log-likelihood L{6) = 




i 1 



(4.1) 




(4.2) 
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"^i^iYi, g{Xj 6)) where i{y,v) = yv — d{v) + ({y) is the log-hkehhood function for J": 

m = Y,m,9{xje)) = Y,{y^9{xJe) - d{g{xje))+m)]. (4.3) 

i i 

For some // > whose value will be specified later, define 

m f,J2{y^9{x7e)-d{g{xJe))}. 

i 

We use the well known properties of the canonical exponential families: EyY = d{v) 
which implies 

EL{e) = iiY,{d{fi)9{Xje)-d{g{Xje))], 

i 

VEHie) = f,^{d{f,)-d{giXje))}g{Xje)X,. 

i 

The target 6q maximizes EL{6) : 

Oo = ^Tgm^xEL{e) = argmax V{d(/i)<7(X70) - d{g{Xje))}. (4.4) 
e e Y 

This particularly yields 

VELiOo) = 0. 

Next, for every 

m m-Em = f^Y.{^^-dif^)}9{x7o), 

i 

vc(0) = ^iY.{^^-^^f^))9ixJe)x,, 

i 

It is easy to see that condition {E) is fulfilled if fi + fi{g{Xj 6) - g{Xj Oo)] G U for ah 
i and all 6 £ . Let n{v) be a function of v which ensures for some fixed AJ > that 

log^„exp|2A^-^^| < 2A^ A < AJ . (4.5) 
I. n{v) J 

Define 

ViiO) = ^n\fi) \giXjefx,Xj, ¥{6) ''^ /xVi(0). (4.6) 

i 

Then for any 7 G S"^ 

'^1 (7Ti/(e)7)V2 / - 

provided that n{fi)\\'^'^ Xi\ < AJ for all i, which implies (ED) . 
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Now we consider the rate function 9Jt(0, 6q) . As logi^^ exp|;uy| = d{v + fi) — d{y) , 
it holds 

9JI(0,0o) =' -logi;exp{i:(0,0o)} 



with 5i{e)'^= g{Xje)-g{XjeQ). For the gradient V9Jt(0,0o) holds: 

i 

The equation VEL{eQ) = ^ implies V9Jt(0, 0o)L=fl =0. Moreover, 



0=00 



Y,^^{d{9{xJe))\g{xJe)\^ + ^l[d{g{xJe)) - - ^l5i{e))\g{xJ e)]x,xj 



The "identifiability" condition which would provide the concentration property of Q in 
an elliptic neighborhood of the point means that 



n-^Y.\d{g{X'jQ))\g{Xjef ^{d{g{X]e))-d{f,)^^^^ 



for a positive matrix D\ . This condition ensures that with a proper choice of /x , the 
value m{e, Oo) satisfies m{e, Oq) > C{e - 00^01(9 - Oq) for some C = C{Di) > . 

In the case of Gaussian regression Yi ~ !N(/j,cj^) , it holds d{v) = t;^/(2cj^) , so that 
d{v) = v/a"^ and d{v) = o""^ . The identifiability condition reads now as 

{na^)-'Y.{\9(^^^f + [9{Xj9)-h]g{Xje)}x,Xj > Oj. 

i 

Theorem 4.1. Suppose that ~ Pf. £ 7 for some EFC 7 . Let the point Oq be defined 
by (4-4) o.'^^d 9 = sigxascKQ !i{9) he its estimate. Let also there exist > and the 
function n{v) such that (4-5) is fulfilled. Let also for some fi* > 

fi + fi*{g{Xj9) - g{Xj9o)} G U, i<n, 9€0, 

and for some A* > and the matrix Vi{9) from (4-6) 

n{f^)h'^Xi\ ^ 
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Then for any /i with < /i < ^* , the conditions {E) and {ED) are fulfilled with V{6) 
from (4-6). For any q <1 and e > with £'e/(l — q) < \* , it holds 

log £; expjsup 0o) + 9J^(^,^o) - pen(0)] | 

< 2e\^/{l -q) + {1- 0)% + log(r ), 

where *P* and pen(0) are defined by (2.5) and (2.6). 

Let further there exist a > , and a matrix V* such that 

m{e,eo)> a'^ {9-60)^0^ (e-Oo), vie)<v*, e^e. 

Then for s = 1 — a\/a^ , it holds 

log£l{g,s) < pC{g)+plog{\a\l-s){l-g)\-^/^) 
and for the confidence set £(3) = {6 : L{6,6) < 3} holds with Q.{q) = £2(^>, 0) 

5 A penalized exponential bound for a random field 

Let {y{v),v £ T) be a random field on a probability space (i7,9", P), where T is a 
separable locally compact space. For any v G T we assume the following exponential 
moment condition to be fulfilled: 

(£) For every v £ T 

Eexpi'^iv)} = 1. 

The aim of this section is to establish a similar exponential bound for a supremum of 
over V £ T . A trivial corollary of the condition (£) is that if the set T is finite 
with N = 4^r, then 

-Eexp< sup 

Unfortunately, in the general case the supremum of ^{v) over v does not necessarily 
fulfill the condition of bounded exponential moments. We therefore, consider a penalized 
version of the process ^{v) , that is, we try to bound the exponential moment of ^{v) — 
pen{v) for some penalty function pen(L') . The goal is to find a possibly minimal such 
function pen(t;) which provides 

E^exp-^ sup[y(t;) — pen{v)] ^ < 1- 
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In the case of a finite set T , a natural candidate is pen(t>) = log(7^T) . Below we show 
how this simple choice can be extended to the case of a general set T . There exists a 
number of results about a supremum of a centered random field which are heavily based 
on the theory of empirical processes. See e.g. the monographes van der Vaart and Wellner 
(1996), Van de Geer (2000), Massart (2007), and references therein. Our approach is a 
bit different. First the process ^{v) does not need to be centered, instead we use the 
normalization E exp{y{v)} = 1 . Secondly we do not assume any particular structure of 
this process like independence of observations, so the methods of the empirical processes 
do not apply here. Finally, our analysis is focuses on the penalty function pen(-) rather 
then on the deviation probability of max^y(^>) . 

5.1 A local bound 

Define M{v) = E'^iv) , CH = '^{v)- E'^iv) , and denote C{i^,i^') = C(^^) - C(i^') for 
v,v' G T . We assume a nonnegative symmetric function D{v,v') is given such that the 
following condition is fulfilled: 

(£e) There exist numbers e > and A* > , such that for any A < A* 

sup log£;exp{2A|;^^^| < 2X^ . 

Let e > be shown in condition (&) . Define for any point v° £ T the "ball" 

'B{e,v°) = {v : S{v,v°) < e}. 

To state the result, we have to introduce the notion of local entropy. We say that a 
discrete set T){e, C) is an e-net in C C T, if 

ec U (5.1) 

By N(eo,e,'U°) for eo < e we denote the local covering number defined as the mini- 
mal number of sets ^(eo, •) required to cover 'B{e,v°) . With this covering number we 
associate the local entropy 

oo 

Q(e, v°) = ^ 2-'= log N(2-*^e, e, v°). 
k=l 

Assume that v° £ T \s fixed. The following result controls the supremum in v of 
the penalized process ^{v) — pen(tj) over the ball 'B{e.,v°) . 
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Theorem 5.1. Assume (£) and (&) . For any g G (0,1) with ge/{l — g) < X* , any 
v° eT 

log£;exp<^ sup g[y{v) - pen{v)]> < ir {I - g)Q{e,v°) - gpen^{v°) 

with 

rtenjv") = inf penfi;). 

■uG3(£,'u°) 

Proof. We begin with some result which bounds the stochastic component of the process 
^{v) within the local ball 'B{e,v°) . 

Lemma 5.2. Assume that Ci'^) ^ separable process satisfying condition (&) . Then 
for any given v° £ T , v'^ £ 'B{e,v°) , and A < A* 

log-E/exp< — sup 

[ e ^,eCB(e,^I°) J 

Proof. The proof is based on the standard chaining argument; see e.g. van der Vaart 
and Wellner (1996). Without loss of generality, we assume that Q(e, i>°) < oo . Then for 
any integer k >0, there exists a 2~'^e-net Dfc(e, i>°) in the local ball 'B{e,v°) having 
the cardinality N(2~'''e, e, v°) . Using the nets I'fc(e, v°) with k = 1, . . . , K — 1 , one can 
construct a chain connecting an arbitrary point v in 'Dk{£,u°) and v"^ . It means that 
one can find points Vj^ G Dfc(e, i>°), k = I, . . . , K — 1 , such that 2)(i>fc, i>fc_i) < 2~^^^e 
for k = 1,...,K. Here vk means v and vq means vK Notice that can be 
constructed recurrently: = rfc_i(i>fc), k = K, . . . ,1 , where 

rfc„i(i)) = argmin T>{v,v'). 

It obviously holds 

K 

k=l 

It holds for ^ivk,Vk-i) = C{vk,Vk-i)/^{vk,Vk-i) that 

with Cfc = Ti{vk, Vk-i)/{2e) <2^^. By condition (£e) log exp{2AC(t^fc, i^fc-i)} < 2A^ 
Next, 

K 

sup C{V,V^) < V sup C{v',Tk^i{v')) 

< 2eV sup Cfc^(t;',rfc_i(t;')). (5.2) 
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Since < 2 , the Holder inequality and condition (&) imply 



log-Eexpi— sup C{v,v^) 'f<logE exp 

K 



|2AV sup CkC{'v',Tk-l{v'))\ 



k=l 
K 



< V2-^log £;exp| sup 2^-Cfc x IX^v' ,Tk^i{v'))\ 



<^2-'=log Yl Eexp{2''ckx2X^{'v',Tk^,{v'))} 

k=l '-r>'eDfe(e,D°) 
K 

< ^2-'={logN(2"'^e,e,t^°) + 2A2}. 



k=l 



These inequalities and the separability of ({v,v'^) yield 

log-Eexpj— sup C('*^5 ''^'')1= li™ logE^expi— sup ({Vjv'^ 

oo 

< ^2-'={2A2 + logN(2-'=e,e,i;°)} < 2X^ + Q{e,v°) 
k=l 

which completes the proof of the lemma. 



□ 



Now define for a fixed a point v° 



i>» = argmin {M{v) + pen(tj)}, 

where M(i>) = —E''^{v) . If there are many such points, then take any of them as v'^ 
Obviously 

sup {y(t^) — pen(t;)} < ^(i;^) — pen(^>^) + sup C{v,v^). 

Therefore, by the Holder inequality and Lemma 5.2 with A = €q/{1 — g) 
log£^exp< sup — pen(t;)] > 

< glog£?exp{y(t;'*) — pen(i7^)} + (1 — log£^exp|-— ^ — sup Cli^ji^")} 

< 2e2f?V(l - ^) + (1 - 0)Q{e,v°) - Qpeniv^) 

< 2e^g^/{l -g) + il- £>)Q(e, v°) - Qpen^{v°). 



which is the assertion of the theorem. 



□ 
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5.2 A global exponential bound for the penalized process 

This section presents some sufficient conditions on the penalty function pen(t;) which 
ensure the general exponential bound for the penalized process ^{v) — pen(v) . For 
simplicity we assume that the local entropy numbers Q(e, v) are uniformly bounded by 
a constant Q*(T) . Let also vr be a cr -finite measure on the space T and tt{A) stand 
for the vr -measure of a set A cT . The standard proposal for vr is the usual Lebesgue 
measure. 

Theorem 5.3. Assume (£) and (&) with some fixed e and A* . Let q < \ he such 
that Qe/{1 — q) < X* . Let also Q{e,v) < Q*(T) for all v £T . Let a a -finite measure 
TT on T be such that for some u > 1 

7r(T,(e,v)) , , 

sup ' < v. (5.3) 

Finally, let a function pen(^>) satisfy 

dcf f 1 
Sjeig) = log / — — —exp{-Qpen^{v°)}dTr{v°) < oo 

with pen^(i;°) = inf^g^^^^o-j pen(i)) . Then 

exp\ sup Q[y{v) - pen('o)] \ < £1{q, e), (5.4) 



E 



where 



\ogQ{Q,e) = ^^ + {l-g)Q*{r) + logu + 9j,{e)- (5-5) 

1- Q 

Proof. We begin with a simple technical result which bounds the maximum of a given 
function via the weighted integral of the local maxima. 

Lemma 5.4. Let f{v) he a nonnegative function on T C JR^ and let for every point 
V £ T a vicinity A{v) he fixed such that v' € A{v) implies v G A{v') . Let also the 
measure tt(^A{v)^ of the set A{v) fulfill for every v° £ T 

^Aj^)) ^ .... 
sup , . < V- (5.6) 
■ueA(-u°) T^\A(v°)) 



Then 



with 



sup/(t;) <v \ f*{v)—^-—dTr{v 



f^iv)"^' sup f{v'). 
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Proof. For every v° T 



dTT{v) 



because v G A{v°) implies v° G A{v) and hence, f{v°) < f*{v) . Now by (5.6) 



tt{A{v)) V 7^(„o) 7r(A('U°)) 

as required. □ 

This result applied to /(^') = exp|^[y(i;) — pen(?;)] | and A(v) = 'B{e,v) implies 
supexp|£<[y(tj) -pen('o)]| < 1/ /" sup exp|g[lj(^;) - pen(^j)] | f^^'" \ . 

This implies by Theorem 5.1 

log-Esupexp< — pen(t;)] > 

.,r)+,„,{./^exp{-.pe,,K,}-j||9^} 



2e2^2 



l-Q 

< ^ + (1 - qW{T) + iog(^.) + 9),{e) 

I - Q 

and the assertion follows. □ 
5.3 Smooth case 

Here we discuss the special case when T C iR^ , the process ^{v) and its stochastic 
component C('*^) ^"^^ absolutely continuous and the gradient \IC,{v) *== dQ{v)/dv has 
bounded exponential moments. We also assume that vr is the Lebesgue measure on T . 
Suppose the following condition is fulfilled: 

(£D) There exist A* > and for each v T , a symmetric non-negative matrix H{v) 
such that for any A < A* 



sup sup log£;exp|2A 7 ^^^^j I < 2A^ 
■u&r-i&sp ^ 11^^(^)711 J 



The matrix function H{v) can be used for defining a natural topology in T . Namely, 
for any v,v' £T define i) = — v'\\ , 7 = (i; — v')/d and 



'i:)^{v,v')=^\\v-v'f [ -y'^H^iv + td-f)-fdt. 

Jo 
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Next, introduce for each v° £T and e > the set 

'B{t,v°) = {v : T){v,v°) < e} 

To state the result, we need one more condition on the uniform continuity of the 
matrix H{v) in v . 

{H) There exist constants e > and vi > 1 such that 

sup sup ^ < VI . 

Theorem 5.5. Let (£) be satisfied. Suppose that (E-D) holds with some X* and a matrix 
function H{v) which fulfills (H) . If for some g G (0,1) and e > with ge/{l — g) < A*, 
the penalty function pen{v) fulfills 

f),{g) = logS^ujph~P j det(F('u°))exp{-£>pen,K)}(ft^°| < oo 

with pen^(tJ°) = inf^g3(^ ^,o-) pen(t;) , then 

expi sup - pen(t))] \ <£!(£), e) (5.7) 

•-tier J 



E 

where 



logn(e,e) = "^^ + {1- g)% + ^,{g)+p\og{vi) 
I - g 

with Qp being the usual entropy number for the Euclidean ball in ]RP . 

Proof. First we show that the differentiabiUty condition (£-D) impUes the local moment 
condition (&) . 

Lemma 5.6. Assume that {8.D) holds with some X* . Then for any v° ^ T and any 

1/2 

A with \X\ < X* /u^' , it holds 



sup log-E;expj2A ^^^''^ \ \ < 2X^ . 



(5.8) 



Proof. For v G 23(e,i;°) , denote c) = — v°\\ , 7 = (u — 'L'°)/0 . With this notation 



C{v,v°) = dj'^ [ VC{v° + td-f)dt. 
Jo 



The condition (H) implies for every t G [0, 1] that 



28 



A PENALIZED EXPONENTIAL RISK BOUND IN PARAMETRIC ESTIMATION 



Now the Holder inequality and (E-D) yield 



= log E exp i^J 7^ 
< / \ogE exp < 7 



VCK + «7) - ^^^^^H\v° + ^7)7 







T 



dt 



■VCiv° + tdj) 



2?,2 



dt 



< 



as required. 



□ 



Next we show that condition {H) implies (5.3). Consider for every v° (^T an elliptic 
neighborhood ^'(e,^^") = {v : \\H{v°){v - v°)\\ < e} . 

Lemma 5.7. Assume (H) . Then 



1. for any e > and any v G T 



2. For every v € T , 



^ g-P^(B(g^^))det(F('u))/c^p < 
where Up is the Lebesgue measure of the unit ball in ]RP . 
3. condition (5.3) holds with v = . 
Proof. Condition {H) implies that for any v° gT and v G 23(e,?;°) that 

v^^-y'^ H'^{v°)'y < [ j'^H^{v° + ^7)7 dt < ui'j'^ H^{v°)j 
Jo 

with c) = II?; — and 7 = {v — v°) /d , which yields the first assertion of the lemma. 

The Lebesgue measure of the ellipsoid ^'(e,^) is equal to cupe^ /det{H{v)) . This 
and (5.9) imply the second assertion. This, in turns, implies (5.3) in view of (H) . □ 

The next result claims that in the smooth case the local entropy number Q(e, v°) is 
similar to the usual Euclidean situation. 



,P/2 



(5.9) 



(5.10) 



Lemma 5.8. Assume (H) . Then sup^^ggi Q(e, i>) < Qp + plog(z/i) . 
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Proof. Fix any i;° G T. Linear transformation with the matrix H~^{v°) reduces the 
situation to the case when H{v°) = / and 'B'{eQ,v°) is a usual Euchdean ball for 
any eq < e . Moreover, by (-ff ) , each elliptic set H>'[eQ,v) for v G 'B{e,v°) is nearly 
an Euclidean ball in the sense that the ratio of its largest and smallest axes (which is 
the ratio of the largest and smallest eigenvalues of H~^{v°)H'^{v)H^^{v°) ) is bounded 
by z/i . Therefore, for any cq < e, a Euclidean net X''^(eo/i^i) with the step eo/z^i 
ensures a covering of 'B{e,v°) by the sets 'B{€o,v°) , v° G Ti^{eQ/ui). Therefore, the 
corresponding covering number is bounded by (i/ie/eo)^ yielding the claimed bound for 
the local entropy. □ 

Now the result of theorem 5.5 is reduced to the statement of Theorem 5.3. □ 

Computing of the penalty simplifies a lot when the matrix H{v) is uniformly bounded 
by a matrix H* , or, equivalently, condition (H) is fulfilled for H{v) = H* . Then one 
can define pen(i;) as a function of the norm ||i7*(i; — Vq)\\ for a fixed i^o • 

Theorem 5.9. Assume additionally to the conditions of Theorem 5.5 that H{v) < H* 
for a symmetric matrix H* . Suppose that >c{t) is a monotonously decreasing positive 
function on [0, +oo) satisfying 

=^ cj-i f x{\\u\\)du =p r >€{t)tP-^dt < oo. (5.11) 
J MP Jo 

Define 

pen{v) = -Q-'^\og>i{e-'^\\H*{v -vo)\\ + l) 

Then 

Eexp\supQ['^{v)-pen{v)]} <£l(^,e) (5.12) 

with 

log Q{Q,e) = ^ + (l-^)Q +log(r), 
where ujp is the volume of the unit hall in ]RP . 

Proof. Let us fix v° . Definition of the semi-metric 2) and condition {H) imply for 
every v G 'B{e,v°) that 

\\H*{v° -v)\\ < e. 
The triangle inequality and {H) now imply for this v that 

e-'\\H*{'v-'VQ)\\ + l>e-^\\H*{'v°-'v^)\\ 
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and pen^(^)°) > —q "^logx(e ^\\H*{v° — vq)\\) ■ Therefore, it follows by change of 
variables u = eH*{v — vq) that 



e-P I det(i7*)exp{-£-pen,(-u)}(i'0 < uj^'^ [ >i:{\\u\\)du 

POO 

< p / x{t)tP-^dt = 
JO 

and the result follows from Theorem 5.5. □ 
Natural candidates for the function x(-) and the corresponding -values are: 

Mt) = \\l + t\\-P-^^ = P/S2, 

where 61,62 > are some constants. The result of Theorem 5.9 yields 
Corollary 5.10. Under conditions of Theorem 5.9, the hound (5.12) holds with 



Y>eni{v) = Q ^5ie '^\\H*{v — vq) 



|2 



\og£l2{Q,e) = ^ + (l-^,)Qp + log(l + .;;VAir/')- 
I — Q 

peni(t;) = -g-\p + 62)\og{e-^\\H* {v - v^)\\ + 2) , 
\ogQi{Q,e) = ^ + (l-^,)Qp + log(p/52), 

Sometimes it is useful to combine the functions xi(-) and X2(-) in the form 

xit) = >ci{t)l{t >r) + X2{t)l{t < r) (5.13) 

for a properly selected r which still ensures (5.11) with 
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